Method and apparatus for speech segmentation

ABSTRACT

Machine-readable media, methods, apparatus and system for speech segmentation are described. In some embodiments, a fuzzy rule may be determined to discriminate a speech segment from a non-speech segment. An antecedent of the fuzzy rule may include an input variable and an input variable membership. A consequent of the fuzzy rule may include an output variable and an output variable membership. An instance of the input variable may be extracted from a segment. An input variable membership function associated with the input variable membership and an output variable membership function associated with the output variable membership may be trained. The instance of the input variable, the input variable membership function, the output variable, and the output variable membership function may be operated, to determine whether the segment is the speech segment or the non-speech segment.

BACKGROUND

Speech segmentation may be a step of unstructured information retrievalto classify the unstructured information into speech segments andnon-speech segments. Various methods may be applied for speechsegmentation. The most commonly used method is to manually extractspeech segments from a media resource that discriminates a speechsegment from a non-speech segment.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention described herein is illustrated by way of example and notby way of limitation in the accompanying figures. For simplicity andclarity of illustration, elements illustrated in the figures are notnecessarily drawn to scale. For example, the dimensions of some elementsmay be exaggerated relative to other elements for clarity. Further,where considered appropriate, reference labels have been repeated amongthe figures to indicate corresponding or analogous elements.

FIG. 1 shows an embodiment of a computing platform that comprises aspeech segmentation system.

FIG. 2 shows an embodiment of the speech segmentation system.

FIG. 3 shows an embodiment of a fuzzy rule and how the speechsegmentation system operates the fuzzy rule to determine whether asegment is speech or not.

FIG. 4 shows an embodiment of a method of speech segmentation by thespeech segmentation system.

DETAILED DESCRIPTION

The following description describes techniques for method and apparatusfor speech segmentation. In the following description, numerous specificdetails such as logic implementations, pseudo-code, means to specifyoperands, resource partitioning/sharing/duplication implementations,types and interrelationships of system components, and logicpartitioning/integration choices are set forth in order to provide amore thorough understanding of the current invention. However, theinvention may be practiced without such specific details. In otherinstances, control structures, gate level circuits and full softwareinstruction sequences have not been shown in detail in order not toobscure the invention. Those of ordinary skill in the art, with theincluded descriptions, will be able to implement appropriatefunctionality without undue experimentation.

References in the specification to “one embodiment”, “an embodiment”,“an example embodiment”, etc., indicate that the embodiment describedmay include a particular feature, structure, or characteristic, butevery embodiment may not necessarily include the particular feature,structure, or characteristic. Moreover, such phrases are not necessarilyreferring to the same embodiment. Further, when a particular feature,structure, or characteristic is described in connection with anembodiment, it is submitted that it is within the knowledge of oneskilled in the art to effect such feature, structure, or characteristicin connection with other embodiments whether or not explicitlydescribed.

Embodiments of the invention may be implemented in hardware, firmware,software, or any combination thereof. Embodiments of the invention mayalso be implemented as instructions stored on a machine-readable medium,that may be read and executed by one or more processors. Amachine-readable medium may include any mechanism for storing ortransmitting information in a form readable by a machine (e.g., acomputing device). For example, a machine-readable medium may includeread only memory (ROM); random access memory (RAM); magnetic diskstorage media; optical storage media; flash memory devices; electrical,optical, acoustical or other forms of propagated signals (e.g., carrierwaves, infrared signals, digital signals, etc.) and others.

An embodiment of a computing platform 10 comprising a speechsegmentation system 121 is shown in FIG. 1. Examples for the computingplatform may include mainframe computer, mini-computer, personalcomputer, portable computer, laptop computer and other devices fortransceiving and processing data.

The computing platform 10 may comprise one or more processors 11, memory12, chipset 13, I/O device 14 and possibly other components. The one ormore processors 11 are communicatively coupled to various components(e.g., the memory 12) via one or more buses such as a processor bus. Theprocessors 11 may be implemented as an integrated circuit (IC) with oneor more processing cores that may execute codes. Examples for theprocessor 20 may include Intel® Core™, Intel® Celeron™, Intel® Pentium™,Intel® Xeon™, Intel® Itanium™ architectures, available from IntelCorporation of Santa Clara, Calif.

The memory 12 may store codes to be executed by the processor 11.

Examples for the memory 12 may comprise one or a combination of thefollowing semiconductor devices, such as synchronous dynamic randomaccess memory (SDRAM) devices, RAMBUS dynamic random access memory(RDRAM) devices, double data rate (DDR) memory devices, static randomaccess memory (SRAM), and flash memory devices.

The chipset 13 may provide one or more communicative path among theprocessor 11, the memory 12, the I/O devices 14 and possibly othercomponents. The chipset 13 may further comprise hubs to respectivelycommunicate with the above-mentioned components. For example, thechipset 13 may comprise a memory controller hub, an input/outputcontroller hub and possibly other hubs.

The I/O devices 14 may input or output data to or from the computingplatform 10, such as media data. Examples for the I/O devices 14 maycomprise a network card, a blue-tooth device, an antenna, and possiblyother devices for transceiving data.

In the embodiment as shown in FIG. 1, the memory 12 may further comprisecodes implemented as a media resource 120, speech segmentation system121, speech segments 122 and non-speech segments 123.

The media resource 120 may comprise audio resource and video resource.Media resource 120 may be provided by various components, such as theI/O devices 14, a disc storage (not shown), and an audio/video device(not shown).

The speech segmentation system 121 may split the media 120 into a numberof media segments, determine if a media segment is a speech segment 122or a non-speech segment 123, and label the media segment as the speechsegment 122 or the non-speech segment 123. Speech segmentation may beuseful in various scenarios. For example, speech classification andsegmentation may be used for audio-text mapping. In this scenario, thespeech segments 122 may go through an audio-text alignment so that atext mapping with the speech segment is selected.

The speech segmentation system 121 may use fuzzy inference technologiesto discriminate the speech segment 122 from the non-speech segment 123.More details are provided in FIG. 2.

FIG. 2 illustrates an embodiment of the speech segmentation system 121.The speech segmentation system 121 may comprise a fuzzy rule 20, a mediasplitting logic 21, an input variable extracting logic 22, a membershipfunction training logic 23, a fuzzy rule operating logic 24, adefuzzifying logic 25, a labeling logic 26, and possibly othercomponents for speech segmentation.

Fuzzy rule 20 may store one or more fuzzy rules, which may be determinedbased upon various factors, such as characteristics of the media 120 andprior knowledge on speech data. The fuzzy rule may be a linguistic ruleto determine whether a media segment is speech or non-speech and maytake various forms, such as if-then form. An if-then rule may comprisean antecedent part (if) and a consequent part (then). The antecedent mayspecify conditions to gain the consequent.

The antecedent may comprise one or more input variables indicatingvarious characteristics of media data. For example, the input variablemay be selected from a group of features including a high zero-crossingrate ratio (HZCRR), a percentage of “low-energy” frames (LEFP), avariance of spectral centroid (SCV), a variance of spectral flux (SFV),a variance of spectral roll-off point (SRPV) and a 4 Hz modulationenergy (4 Hz). The consequent may comprise an output variable. In theembodiment of FIG. 2, the output variable may be speech-likelihood.

The following may be an example of the fuzzy rule used for a media undera high SNR (signal noise ratio) environment.

Rule one: if LEFP is high or SFV is low, then speech-likelihood isspeech; and

Rule two: if LEFP is low and HZCRR is high, then speech-likelihood isnon-speech.

The following may be another example of the fuzzy rule used for a mediaunder a low SNR environment.

Rule one: if HZCRR is low, then speech-likelihood is non-speech;

Rule two: if LEFP is high then speech-likelihood is speech;

Rule three: if LEFP is low then speech-likelihood is non-speech;

Rule four: if SCV is high and SFV is high and SRPV is high, thenspeech-likelihood is speech;

Rule five: if SCV is low and SFV is low and SRPV is low, thenspeech-likelihood is non-speech;

Rule six: if 4 Hz is very high, then speech-likelihood is speech; and

Rule seven: if 4 Hz is low, then speech-likelihood is non-speech.

Each statement of the rule may admit a possibility of a partialmembership in it. In other words, each statement of the rule may be amatter of degree that the input variable or the output variable belongsto a membership. In the above-stated rules, each input variable mayemploy two membership functions defined as: “low” and “high”. The outputvariable may employ two membership functions defined as “speech” and“non-speech”. It should be appreciated that the fuzzy rule may associatedifferent input variables with different membership functions. Forexample, input variable LEFP may employ “medium” and “low” membershipfunctions, while input variable SFV may employ “high” and “medium”membership functions.

Membership function training logic 23 may train the membership functionsassociated with each input variable. The membership function may beformed in various patterns. For example, the simplest membershipfunction may be formed in a straight line, a triangle or a trapezoidal.The two membership functions may be built on the Gaussian distributioncurve: a simple Gaussian curve and a two-sided composite of twodifferent Gaussian curves. The generalized bell membership function isspecified by three parameters.

Media splitting logic 21 may split the media resource 120 into a numberof media segments, for example, each media segment in a 1-second window.Input variable extracting logic 22 may extract instances of the inputvariables from each media segment based upon the fuzzy rule 20. Fuzzyrule operating logic 24 may operate the instances of the inputvariables, the membership functions associated with the input variables,the output variable and the membership function associated with theoutput variable based upon the fuzzy rule 20, to obtain an entire fuzzyconclusion that may represent possibilities that the output variable(i.e., speech-likelihood) belongs to a membership (i.e., speech ornon-speech).

Defuzzifying logic 25 may defuzzify the fuzzy conclusion from the fuzzyrule operating logic 24 to obtain a definite number of the outputvariable. A variety of methods may be applied for the defuzzification.For example, a weighted-centroid method may be used to find the centroidof a weighted aggregation of each output from each fuzzy rule. Thecentroid may identify the definite number of the output variable (i.e.,the speech-likelihood).

Labeling logic 26 may label each media segment as a speech segment or anon-speech segment based upon the definite number of thespeech-likelihood for this media segment.

FIG. 3 illustrates an embodiment of the fuzzy rule 20 and how the speechsegmentation system 121 operates the fuzzy rule to determine whether asegment is speech or not. As illustrated, the fuzzy rule 20 may comprisetwo rules:

Rule one: if LEFP is high or SFV is low, then speech-likelihood isspeech; and

Rule two: if LEFP is low and HZCRR is high, then speech-likelihood isnon-speech.

Firstly, the fuzzy rule operating logic 24 may fuzzify each inputvariable of each rule based upon the extracted instances of the inputvariables and the membership functions. As stated-above, each statementof the fuzzy rule may admit a possibility of partial membership in itand the truth of the statement may become a matter of degree. Forexample, the statement ‘LEFP is high’ may admit a partial degree thatLEFP is high. The degree that LEFP belongs to the “high” membership maybe denoted by a membership value between 0 and 1. The “high” membershipfunction associated with LEFP as shown in the block B₀₀ of FIG. 3 maymap a LEFP instance to its appropriate membership value. A process ofutilizing the membership function associated with the input variable andthe extracted instance of the input variable (e.g., LEFP=0.7, HZCRR=0.8,SFV=0.1) to obtain a membership value may be called as “fuzzifyinginput”. Therefore, as shown in FIG. 3, the input variable “LEFP” of ruleone may be fuzzified into the “high” membership value 0.4. Similarly,the input variable “SFV” of rule one may be fuzzified into the “low”membership value 0.8; the input variable “LEFP” of rule two may befuzzified into “low” membership value 0.1; and the input variable“HZCRR” may be fuzzified into “high” membership value 0.5.

Secondly, the fuzzy rule operating logic 24 may operate the fuzzifiedinputs of each rule to obtain a fuzzified output of the rule. If theantecedent of the rule comprises more than one part, a fuzzy logicaloperator (e.g., AND, OR, NOT) may be used to obtain a value representinga result of the antecedent. For example, rule one may have two parts“LEFP is high” and “SFV is low”. Rule one may utilize the fuzzy logicaloperator “OR” to take a maximum value of the fuzzified inputs, i.e., themaximum value 0.8 of the fuzzified inputs 0.4 and 0.8, as the result ofthe antecedent of rule one. Rule two may have two other parts “LEFP islow” and “HZCRR is high”. Rule two may utilize the fuzzy logic operator“AND” to take a minimum value of the fuzzified inputs, i.e., the minimumvalue 0.1 of the fuzzified inputs 0.1 and 0.5, as the result of theantecedent of rule two.

Thirdly, for each rule, the fuzzy rule operating logic 24 may utilize amembership function associated with the output variable“speech-likelihood” and the result of the rule antecedent to obtain aset of membership values indicating a set of degrees that thespeech-likelihood belongs to the membership (i.e., speech ornon-speech). For rule one, the fuzzy rule operating logic 24 may applyan implication method to reshape the “speech” membership function bylimiting the highest degree that the speech-likelihood belongs to“speech” membership to the value obtained from the antecedent of ruleone, i.e., the value 0.8. Block B₀₄ of FIG. 3 shows a set of degreesthat the speech-likelihood may belong to “speech” membership for ruleone. Similarly, block B₁₄ of FIG. 3 shows another set of degrees thatthe speech-likelihood may belong to “non-speech” membership for ruletwo.

Fourthly, the defuzzifying logic 25 may defuzzify the output of eachrule to obtain a defuzzified value of the output variable“speech-likelihood”. The output from each rule may be an entire fuzzyset that may represent degrees that the output variable“speech-likelihood” belongs to a membership. A process of obtain anabsolute value of the output is called “defuzzification”. A variety ofmethods may be applied for the defuzzification. For example, thedefuzzifying logic 25 may obtain the absolute value of the output byutilizing the above-stated weighted-centroid method.

More specifically, the defuzzifying logic 25 may assigning a weight toeach output of each rule, such as the set of degrees as shown in blockB₀₄ of FIG. 3 and the set of degrees as shown in block B₁₄ of FIG. 3.For example, the defuzzifying logic 25 may assign weight “1” to theoutput of rule one and the output of rule two. Then, the defuzzifyinglogic 25 may aggregate the weighted outputs and obtain a union that maydefine a range of output values. Block B₂₀ of FIG. 3 may show the resultof the aggregation. Finally, the defuzzifying logic 25 may find acentroid of the aggregation as the absolute value of the output“speech-likelihood”. As shown in FIG. 3, the speech-likelihood value maybe 0.8, upon which the speech segmentation system 121 may determinewhether the media segment is speech or non-speech.

FIG. 4 shows an embodiment of a method of speech segmentation by thespeech segmentation system 121. In block 401, the media splitting logic21 may split the media 120 into a number of media segments, for example,each media segment in a 1-second window. In block 402, the fuzzy rule 20may comprise one or more rules that may specify conditions ofdetermining whether a media segment is speech or non-speech. The fuzzyrules may be determined based upon characteristics of the media 120 andprior knowledge on speech data.

In block 403, the membership function training logic 23 may trainmembership functions associated with each input variable of each fuzzyrule. The membership function training logic 23 may further trainmembership functions associated with the output variable“speech-likelihood” of the fuzzy rule. In block 404, the input variableextracting logic 22 may extract the input variable from each mediasegment according to the antecedent of each fuzzy rule. In block 405,the fuzzy rule operating logic 24 may fuzzify each input variable ofeach fuzzy rule by utilizing the extracted instance of the inputvariable and the membership function associated with the input variable.

In block 406, the fuzzy rule operating logic 24 may obtain a valuerepresenting a result of the antecedent. If the antecedent comprises onepart, then the fuzzified input from that part may be the value. If theantecedent comprises more than one parts, the fuzzy rule operating logic24 may obtain the value by operating each fuzzified input from each partwith a fuzzy logic operator, e.g., AND, OR or NOT, as denoted by thefuzzy rule. In block 407, the fuzzy rule operating logic 24 may apply animplication method to truncate the membership function associated to theoutput variable of each fuzzy rule. The truncated membership functionmay define a range of degrees that the output variable belongs to themembership.

In block 408, the defuzzifying logic 25 may assign a weight to eachoutput from each fuzzy rule and aggregate the weighted output to obtainan output union. In block 409, the defuzzifying logic 25 may apply acentroid method to find a centroid of the output union as a value of theoutput variable “speech-likelihood”. In block 410, the labeling logic 26may label whether the media segment is speech or non-speech based uponthe speech-likelihood value.

While certain features of the invention have been described withreference to example embodiments, the description is not intended to beconstrued in a limiting sense. Various modifications of the exampleembodiments, as well as other embodiments of the invention, which areapparent to persons skilled in the art to which the invention pertainsare deemed to lie within the spirit and scope of the invention.

1. A method, comprising: determining a fuzzy rule to discriminate aspeech segment from a non-speech segment, wherein an antecedent of thefuzzy rule includes an input variable and an input variable membership,and wherein a consequent of the fuzzy rule includes an output variableand an output variable membership; extracting an instance of the inputvariable from a segment; training an input variable membership functionassociated with the input variable membership and an output variablemembership function associated with the output variable membership; andoperating the instance of the input variable, the input variablemembership function, the output variable, and the output variablemembership function, to determine whether the segment is the speechsegment or the non-speech segment.
 2. The method of claim 1, wherein theantecedent admits a first partial degree that the input variable belongsto the input variable membership.
 3. The method of claim 1, wherein theconsequent admits a second partial degree that the output variablebelongs to the output variable membership.
 4. The method of claim 1,wherein the input variable comprises at least one variable selected froma group of percentage of low-energy frames (LEFP), high zero-crossingrate ratio (HZCRR), variance of spectral centroid (SCV), variance ofspectral flux (SFV), variance of spectral roll-off point (SRPV) and 4 Hzmodulation energy (4 Hz).
 5. The method of claim 4, wherein the outputvariable is speech-likelihood.
 6. The method of claim 5, wherein thefuzzy rule comprises: a first rule stating that if LEFP is high or SFVis low, then the speech-likelihood is speech; and a second rule statingthat if LEFP is low and HZCRR is high, then the speech-likelihood isnon-speech.
 7. The method of claim 5, wherein the fuzzy rule comprises:a first rule stating that if HZCRR is low, then the speech-likelihood isnon-speech; a second rule stating that if LEFP is high, then thespeech-likelihood is speech; a third rule stating that if LEFP is low,then the speech-likelihood is non-speech; a fourth rule stating that ifSCV is high and SFV is high and SRPV is high, then the speech-likelihoodis speech; a fifth rule stating that if SCV is low and SFV is low andSRPV is low, then the speech-likelihood is non-speech; a sixth rulestating that if 4 Hz is high, then the speech-likelihood is speech; anda seventh rule stating that if 4 Hz is low, then the speech-likelihoodis non-speech.
 8. The method of claim 1, wherein the operating furthercomprises: fuzzifying the input variable based upon the instance of theinput variable and the input variable membership function, to provide afuzzified input indicating a first degree that the input variablebelongs to the input variable membership; reshaping the output variablemembership function based upon the fuzzified input, to provide an outputset indicating a group of second degrees that the output variablebelongs to the output variable membership; defuzzifying the output setto provide a defuzzified output; and labeling whether the segment is thespeech segment or the non-speech segment based upon the defuzziedoutput.
 9. The method of claim 8, wherein the defuzzifying furthercomprises: If the fuzzy rule comprises one rule, then finding a centroidof the output set to provide the defuzzified output; If the fuzzy rulecomprise a plurality of rules, then multiplying each of a plurality ofweights with the output set obtained through each of the plurality ofrules, to provide each of a plurality of weighted output sets;aggregating the plurality of weighted output sets to provide an outputunion; and finding a centroid of the output union to provide thedefuzzied output.
 10. A machine-readable medium comprising a pluralityof instructions which when executed result in a system: determining afuzzy rule to discriminate a speech segment from a non-speech segment,wherein an antecedent of the fuzzy rule includes an input variable andan input variable membership, and wherein a consequent of the fuzzy ruleincludes an output variable and an output variable membership;extracting an instance of the input variable from a segment; training aninput variable membership function associated with the input variablemembership and an output variable membership function associated withthe output variable membership; and operating the instance of the inputvariable, the input variable membership function, the output variable,and the output variable membership function, to determine whether thesegment is the speech segment or the non-speech segment.
 11. The machinereadable medium of claim 10, wherein the antecedent admits a firstpartial degree that the input variable belongs to the input variablemembership.
 12. The machine readable medium of claim 10, wherein theconsequent admits a second partial degree that the output variablebelongs to the output variable membership.
 13. The machine readablemedium of claim 10, wherein the input variable comprises at least onevariable selected from a group of percentage of low-energy frames(LEFP), high zero-crossing rate ratio (HZCRR), variance of spectralcentroid (SCV), variance of spectral flux (SFV), variance of spectralroll-off point (SRPV) and 4 Hz modulation energy (4 Hz).
 14. The machinereadable medium of claim 13, wherein the output variable isspeech-likelihood.
 15. The machine readable medium of claim 14, whereinthe fuzzy rule comprises: a first rule stating that if LEFP is high orSPV is low, then the speech-likelihood is speech; and a second rulestating that if LEFP is low and HZCRR is high, then thespeech-likelihood is non-speech.
 16. The machine readable medium ofclaim 14, wherein the fuzzy rule comprises: a first rule stating that ifHZCRR is low, then the speech-likelihood is non-speech; a second rulestating that if LEFP is high, then the speech-likelihood is speech; athird rule stating that if LEFP is low, then the speech-likelihood isnon-speech; a fourth rule stating that if SCV is high and SFV is highand SRPV is high, then the speech-likelihood is speech; a fifth rulestating that if SCV is low and SFV is low and SRPV is low, then thespeech-likelihood is non-speech; a sixth rule stating that if 4 Hz ishigh, then the speech-likelihood is speech; and a seventh rule statingthat if 4 Hz is low, then the speech-likelihood is non-speech.
 17. Themachine readable medium of claim 10, wherein the plurality ofinstructions that result in the system operating, further result in thesystem: fuzzifying the input variable based upon the instance of theinput variable and the input variable membership function, to provide afuzzified input indicating a first degree that the input variablebelongs to the input variable membership; reshaping the output variablemembership function based upon the fuzzified input, to provide an outputset indicating a group of second degrees that the output variablebelongs to the output variable membership; defuzzifying the output setto provide a defuzzified output; and labeling whether the segment is thespeech segment or the non-speech segment based upon the defuzziedoutput.
 18. The machine readable medium of claim 17, wherein theplurality of instructions that result in the system defuzzying, furtherresult in the system: If the fuzzy rule comprises one rule, then findinga centroid of the output set to provide the defuzzified output; If thefuzzy rule comprise a plurality of rules, then multiplying each of aplurality of weights with the output set obtained through each of theplurality of rules, to provide each of a plurality of weighted outputsets; aggregating the plurality of weighted output sets to provide anoutput union; and finding a centroid of the output union to provide thedefuzzied output.